Method and arrangement for bringing together data on parallel data paths

ABSTRACT

A processor arrangement having a strip structure for parallel data processing is configured so that local data from the individual processing units or strips is brought together in a rapid manner. Input data, intermediate data and/or output data from various processing units are linked together in an operation which is at least partially combinatory. The data linking operation is not clock controlled. The linking of the local data from various strips in this manner reduces delays in parallel data processing in the processor arrangement. The combinatory data linking operation can provide an overall data linking outcome within an individual clock cycle.

CROSS REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of International Patent ApplicationNo. PCT/DE03/00417 filed Feb. 12, 2003, which claims priority to GermanPatent Application No. 102 06 830.5 filed Feb. 18, 2002, both of whichapplications are hereby incorporated by reference in their entiretiesherein.

FIELD OF THE INVENTION

The invention relates to data flow in parallel data processingarrangements. In particular, the invention relates to systems andmethods for splitting data for parallel processing and recombiningprocessed split data.

BACKGROUND OF THE INVENTION

Processors for parallel data processing have long been known. Acharacteristic of common parallel processor architecture is theprovision of a plurality of processing units, by which parallelprocessing of data can be accomplished. Such an architecture andprocessing unit assigned method are described, for example, in GermanLetters of Disclosure DE 198 35 216. This German Letters of Disclosuredescribes data in a data memory being split into data groups with aplurality of elements and stored under one and the same address. Eachelement of a data group is assigned to a processing unit. All dataelements are simultaneously read out of the data memory in parallel anddistributed as input data to one or more processing units, where theyare processed in parallel under clock control. The parallel processingunits are connected together via a communication unit. A processing unitcomprises at least one process unit and one storage unit, arranged in astrip. Each strip in the processing unit is generally adjacent to atleast one additional strip of like structure.

Such processor units may be referred to as Single Instruction MultipleData (SIMD) vector processor. In SIMD processors, the respective dataelements are processed in the parallel data paths (i.e. strips) asdescribed above. Depending upon the program to be processed, the partialresults may be written in the group memory as corresponding dataelements or as data groups. Under some circumstances, however, it may benecessary to bring together processed data from parallel data paths. Forexample, in the performance of an algorithm on the vector processor, itmay be necessary to link together into a global intermediate result datacalculated locally from a plurality of strips or alternatively from allstrips. For this purpose, in prior art, the partial results of thestrips have been calculated with the aid of a program over a pluralityof clock cycles in order to obtain the desired intermediate result. Ifthis global intermediate result is required for subsequent calculationsof the algorithm, calculation of the end result is delayed.

Consideration is now being given to improved parallel processing methodsand arrangements. The desirable processing methods and arrangementsachieve higher processing speeds, for example, by incorporatingprocessor functionality that permits local data from individual datastrips to be linked without requiring a great expenditure of time.

SUMMARY OF THE INVENTION

Parallel processing methods and processor arrangements are provided forachieving high data processing speeds. The inventive methods andprocessor arrangements are configured so that input, intermediate and/oroutput data of a variety of processing units can be linked via at leastone section wise combinatorial operation, which is not aclock-controlled operation.

With the provision of a combinatorial linking operation that is notclock-controlled, data elements or groups assigned to a variety ofprocessing units can be quickly brought together in a surprisingly quickand simple manner. With the combinatorial linking operation thenecessary linkage of data does not significantly delay parallel dataprocessing in the processor arrangement. In particular, it may bepossible with the combinatorial operation to provide the total result ofdata linkage within a single clock cycle. This feature is especiallyadvantageous when data from all processing units are linked by thelinking operation in order to provide accelerated processing of specificalgorithms that run on the processor arrangement.

The combinatorial linking operation may be deployed in either logicaland/or arithmetic operations. Thus, all possible linkages of data from avariety of processing units and parallel data paths can be obtained inaccordance with the principles of the present invention.

In an implementation of the invention, the combinatorial linkingoperation may involve a redundant numeric representation in at least onepartial step. In particular, in the performance of arithmetic operationssuch as addition or subtraction, carries at all positions of the datacan be conducted or performed simultaneously and used for the nextpartial step. There is no need to perform carry individually within thepartial step of the operation, which would delay the processing forsubsequent positions. Thus, the carry vector can propagate almost asrapidly as the sum vector within a partial step. Any delay due to a“ripple” effect can occur only in the last partial step operation inwhich the sum and carry vectors are brought together.

In order to meet all possible requirements of any desired algorithm, asingle data element or alternatively a data group is produced as theresult of linkage of the local data in accordance with the principles ofthe present invention. Any desired data sources from the various stripscan be linked together and the result fed to any desired data sinks ofthe processor arrangement.

In an advantageous implementation of the invention, the result of thelinking operations can be fed back to a processing unit, which enables,for example, recursive algorithms to be performed more rapidly.

In another implementation of the invention utilizing a plurality ofinstalled combinatorial operations that are not clock-controlled, asingle one can be selected for bringing together data. In this way, aplurality of algorithms and/or complex algorithms with a plurality ofvarious assemblies of local data can be converted in a processorarrangement.

The inventive processor arrangements include at least a section wisecombinatorial linkage arrangement, which is not a clock-controlledlinkage arrangement. The combinatorial linkage arrangement can linktogether data from a variety of strips, and in particular can linktogether input, intermediate and/or output data of a variety ofprocessing units. The inventive processor arrangements permit theassembly of local data from a variety of strips required for certainalgorithms to be performed more rapidly than in the prior art. Delaysthat occur in conventional parallel data processing are avoided.

In a version of the inventive processor arrangements, the linkagearrangement may comprise an addition network, subtraction network and/ora network for minimum/maximum formation. Such networks are capable ofascertaining the carry at a position of the data resulting in a specificstep of the performance the operation in the logic arrangement,independently of the results of preceding positions or steps. Thelinkage arrangement may be designed so that carries occurring in all oralmost all partial steps are not used for the calculation of subsequentpositions. Thus in only part of the linkage network in which a sumvector and a carry vector are brought together does a known delay occur.

In a linkage network for minimum/maximum formation via a plurality ofstrips, it may be advantageous to pass on between calculation steps anindex, which represents or indicates the strip with an extreme datavalue, in addition to the extreme data value itself.

In accordance with the principles of the invention, a processorarrangement for linkage of a wide variety of data within a data path orstrip with data from other data path or strips may include have aplurality of selectable linkage arrangements of various types. Theselection of appropriate linkage arrangements of various types may beprogram controlled. By suitable selection of the linking arrangements,it may be possible to link the same data variably, logically orarithmetically.

A deployed linking arrangement may be configured so that its output maybe connected with any desired registers of the processor arrangement.The linking arrangement output may, for example, be connected to aregister of a processing unit, or alternatively with a global registerin which a data group is capable of being filed or stored.

In order to avoid unnecessary energy consumption in the processorcircuits, at least one input register of the linkage arrangement may beassigned a control mechanism or switch which can be operated to separatethe input register and hence its data from the linkage arrangement.Since the linkage arrangement operates at least section wisecombinatorially in a manner that is not clock-controlled, changes at thedata input of the linkage arrangement can be prevented fromautomatically impacting the linkage arrangement, even though this is notnecessary and/or all input data are not yet present at the given moment.

BRIEF DESCRIPTION OF THE DRAWINGS

Further features of the invention, its nature, and various advantageswill be more apparent from the following detailed description and theaccompanying drawings, wherein like reference characters represent likeelements throughout, and in which:

FIG. 1 is a schematic representation of processor architecture inaccordance with the principles of the present invention;

FIG. 2 is a schematic representation of another processor architecturein accordance with the principles of the present invention; and

FIG. 3 is a schematic representation of an exemplary linkage arrangementfor the linkage of data from a variety of parallel data paths, inaccordance with the principles of the present invention.

The following is an index of the reference numerals and labels that areused in FIGS. 1-3 to identify drawing elements.

REFERENCE NUMBERS AND LABELS INDEX

0 strip

1 group memory

2 processing unit

3 global communication unit

4 local linkage arrangement

5 global linkage arrangement

6 control means/latch

7 control means

8 global linkage arrangement

9 local data feedback

10 global data feedback

11 control line for global linkage arrangement

S1 first step of an addition network

S2 second step of an addition network

S3 third step of an addition network

Ri input register (i=0 . . . N)

RRi output register (i=0 . . . N)

Di data word (i=0 . . . 3)

VA full adder

HA half adder

C carry of a full adder

Gj j^(th) position of the sum (j=0 . . . 5)

DESCRIPTION OF THE INVENTION

Parallel processing methods and arrangements are provided for improvingthe speed of data processing. The parallel processing arrangements areconfigured with data linking arrangements so that processed data fromindividual data processing strips or processing units can be linked orbrought together without requiring a great expenditure of time.

FIG. 1 shows a schematic representation of an exemplary processorarrangement designed in accordance with the principles of the presentinvention. The processor arrangement includes a plurality of paralleldata processing strips (0). Further the processor arrangement includes agroup memory 1, in which data groups are capable of being stored underone address, where a single data group has a plurality of data elements.Processing units 2, each with an input register R₀ . . . R_(N) and anoutput register RR₀ . . . RR_(N), are arranged in a strip structure. Inan alternate version of the inventive processor arrangement, theregisters may be designed as a register set, which includes a pluralityof input and output registers.

With continued reference to FIG. 1, a global linkage arrangement 5 isinserted after the output registers RR₀ to RR_(N). Global linkagearrangement 5 is designed to be a combinatorial addition network in theinput step. Global linkage arrangement 5 is not clock controlled.Further, a global communication unit 3 may be disposed betweenprocessing units 2 and group memory 1. Data from group memory 1 may befed to the respective processing units 2 via communication unit 3.Additionally or alternatively, it is possible that a data group or atleast one element of the data group may be connectable directly with theassigned processing units bypassing the communication unit. A data groupis simultaneously read out of the data memory in parallel anddistributed to a plurality of processing units 2 for processing inparallel. The processing units 2 in each instance include at least oneprocess unit and one arithmetic logic unit (not shown).

In versions of the processor arrangement where the registers aredesigned as register sets, at least one input linkage logic and at leastone output linkage logic with which the registers of a register set areconnectable within a data path, may additionally be arranged between aprocessing unit and the assigned input register set and output registerset.

In operation, as previously described, each element of a data group fromthe group memory 1 may be sent either directly to the assignedprocessing unit or via communication unit 3 to be distributed to otherprocessing units. The sent data reach the respective processing units 2via input registers R₀ . . . R_(N). Then, the data results of processingunits 2 are written in respective output register RR₀ . . . RR_(N).These result data may in turn be written directly in the group memory 1or be distributed by means of the communication unit 3.

A local linkage arrangement 4 is disposed between adjacent processingunits 2. Local linkage arrangement may be utilized to link data from twoadjacent processing units 2 in a combinatorial manner that is notclock-controlled. The linkage results may be written back to either oneof the two processing units. In exemplary local linkage arrangement 4,the two data elements are XOR-linked via a combinatorial, which is not aclock-controlled network. Accordingly, no additional clock cycle isnecessary for ascertainment of the result, and the processing unit inwhich the result is further processed experiences no internal delay.

As explained above, all output registers RR₀ . . . RR_(N) are connectedwith the global linkage arrangement 5, in which the individual outputdata of all (N+1) processing units 2 are added. The addition network ofthe linkage arrangement 5 is represented in FIG. 3 as a four-stripprocessor arrangement where, for the sake of simplicity ofrepresentation, four bit-data words are added in the linkagearrangement.

FIG. 3 shows operation of the linkage arrangement in rows S1-S3. In FIG.3 a data word position are labeled as Dij, where the index i identifiesthe data word (i.e., the strip) and j identifies the data word position.In a first step at row Sl, the individual bits of three data words D0-D2from the registers RR₀, RR₁ and RR₂ are added by means of four fulladders VA. The results of each adder are given in a second step to anassigned full adder VA at row S2. Also, the carries C to the full addersof the next position are given to assigned full adders VA (e.g., stepS2). The respective bit of the fourth data word D3 from the register RR₃of the fourth strip is also shown as being present in the four fulladders VA at the second row S2. Since, in the first and second steps atrows S1 and S2 of the linkage arrangement, the carry C is nottransferred to the full adders of the subsequent data word position, allcalculations in both steps S1 and S2 can be performed simultaneously andimmediately with the data fed to the inputs of the full adders. Only inthe last step at row S3, which includes a half adder HA and threesubsequent full adders VA, are the carries C of a lower data wordposition sent or transferred on to the full adder of the subsequentposition. In this manner, the linkage arrangement shown in FIG. 3 can beaccomplished by three transfers within the last partial step (S3), whichis easily performed in a single clock cycle. As a result, a 6-bit dataword G (G0-G5) is produced in a last step of the linking arrangement, bybringing together a carry and a sum vector. The higher positions of theresult word G are filled with zeros for the formation of a data group(not shown) and may be fed via the control means 7 into globalcommunication unit 3, from which the calculated data group is eitherstored in the group memory or distributed to the processing means.

Since the first two steps at rows S1 and S2 are performedcombinatorially and are not clock-controlled, each input of the globallinkage arrangement 5 may have a controllable gate 6 in the form of alatch by which a change in the output registers RR₁ . . . RR_(N) can befed into the global linkage arrangement 5. With this configuration ofglobal linkage arrangement 5, a change in an output register of aprocessing unit is precluded from automatically causing global linkagearrangement 5 to link data, which is always associated with theconsumption of energy. In this way, the linkage of data can be moved tosuch time at which the data brought together in the linkage arrangementare required or such time after all input data are present.

Other versions of the combinatorial linkage arrangement may include theprovision of an additional XOR-linkage as a subtraction network or maycomprise a shift arrangement or an inverter.

In an implementation of the invention, a processor arrangement includesa linkage arrangement for maximum formation is designed via a pluralityof strips. The arrangement has a plurality of calculation steps, inwhich the data of two strips are in each instance subtracted from oneanother. If the result is negative, the subtrahend is sent on to thenext calculation step. If the result is positive, the minuend is sent onto the next calculation step. At the same time, an index to thiscalculation step is transmitted, which indicates the strip in which ofthe strips thus far considered the extreme lies. In a maximum formationover 8 strips, an index of 3 bits and 7 calculation steps are thusrequired. These may be processed cascade-like, but may alternatively bedesigned for at least partially processing in parallel.

FIG. 2 shows a schematic representation of another exemplary processorarrangement designed in accordance with the principles of the presentinvention. In the processor arrangement of FIG. 2, a global linkagearrangement 8 is connected with the input registers R₀ . . . R_(N). Thelinkage arrangement 8 includes two separate and independent logicarrangements. Any one of which may be selected by means of the controlline 11. The first logic arrangement produces a data group that is fedback via the global data feedback 10 into the global communication unit3. In the second logic arrangement, on the other hand, a data element isproduced which is fed back via the local data feedback 9 into the inputregister R1 of the processing unit 2 of the second strip.

1. A method for bringing together data from parallel data paths in aprocessor arrangement, wherein data in a data memory are split andstored in data groups with a plurality of elements under one and thesame address, read out from the latter and fed to processing units,wherein each element of a data group is assigned a processing unit andall elements of a data group are read out of the data memorysimultaneously and in parallel and distributed as input data to theprocessing units and in the latter are processed clock-controlled inparallel, the method comprising the step of: directly linking outputdata from the processing units using at least one operation which iscombinatorial and which is not clock-controlled, wherein the using atleast one operation comprises using a redundant numerical representationat least one partial step of the operation so that carries in allpositions of the data are formed simultaneously and used for the nextpartial step; and feeding back a produced data element or data groupresult of the operation into at least one of the processing unit or intoa register assigned to this processing unit.
 2. A processor arrangementfor parallel, clock-controlled data processing with a data memorydesigned as group memory and parallel processing units, which areconnected together via a communication unit, wherein at least one datagroup with a plurality of elements is stored in the group memory underone address and each data element is assigned a processing unit, theprocessor arrangement comprising: at least one process unit and at leastone memory unit, which are located in a strip that is adjacent to atleast an additional strip of like structure; a programmed controlledcombinatorial linkage arrangement comprising at least one of an additionnetwork, a subtraction network, and a network for minimum/maximumformation, and in which network the carry in a specified linking step isascertainable at a position of the data independent of the precedingpositions, which is not clock-controlled and which directly linkstogether data from the strips, wherein the data comprises output datafrom the processing units, wherein at least one data element or datagroup output register of the linkage arrangement is assigned a controllatch by which the register data are separable from the linkagearrangement.
 3. The method of claim 1 comprising linking data of allprocessing units.
 4. The method of claim 1 further comprising producinga data element or a data group as the result of using the at least oneoperation.
 5. The method of claim 1 wherein using at least one operationcomprises selecting one of a plurality of combinatorial operations thatare not clock- controlled operations.
 6. The processor arrangement ofclaim 2 comprising a plurality of such linkage arrangements, wherein aselection of the linkage arrangements is program-controlled.
 7. Theprocessor arrangement of claim 2 wherein an output of the linkagearrangement is connected with a register of a processing unit or aregister assigned to a data group.